Mountain Bike (MTB) Categorization Analysis

Introduction

Project Overview

For this project, our team will determine whether the specifications of mountain bikes (MTB) are enough to differentiate between the different types of mountain bike categories.

Currently, full suspension mountain bikes come in multiple categories:

  • Cross Country (XC) | Tend to be the most lightweight, nimble, and designed to put the rider in an efficient pedaling position
  • Enduro (EN) | Heavier frames, more travel and more downhill oriented geometry
  • Trail (TR) | The most common category of bikes, considered to be the halfway point between XC and Enduro
  • All Mountain (AM) | A more niche category which some manufacturers claim to be more downhill focused than trail bikes, but not designed for downhill races like Enduro bikes are
  • Downcountry (DC) | A relatively new category between XC and Trail. Similar to the All Mountain category, these bikes aren’t race specific like XC bikes tend to be, but are lighter and faster than trail bikes.

With all of the factors to consider when designing a bike, there are no clear boundaries between these categories. For example, one brand’s Downcountry bike could be what another brand considers a Trail bike. The popular mountain biking website PinkBike has done in-depth analyses of many bikes across all categories, and covered the topics of which category bikes should be classified as and of how many categories is sufficient, as seen in the video here.

The goal of our project is to determine how many, if any, discrete categories should exist for mountain bikes. Since most specifications and geometric measurements have one direction when moving across the spectrum of bikes, it’s reasonable to believe that these measurements could be reduced to much fewer dimensions, and perhaps even one continuous principle component rather than discrete categories. We can also cluster the bikes together based on some of these specifications and geometric measurements.

As an example, here is a diagram of some of the different types of geometric specifications on mountain bikes:

Various Dimension Features of a Bike’s Geometry

The Data

The data was retrieved manually from each of the mountain bike company’s websites. Let’s take a look at the data.

# Read in sheet 2 of our data
mtb_data <- read_excel(here::here('Data/mtb_stats.xlsx'), 'Sheet1')
mtb_data <- mtb_data %>% 
  # Clean up the label column
  mutate(label = str_replace_all(str_to_lower(label), '[:punct:]', ''),
         # Create a feature for the long-version of the names
         bike_category = case_when(
          label == 'tr' ~ 'Trail',
          label == 'xc' ~ 'Cross Country',
          label == 'dc' ~ 'Downcountry',
          label == 'am' ~ 'All Mountain',
          label == 'en' ~ 'Enduro',
          TRUE ~ 'red'
        ))

# Pull out the class labels
labels <- mtb_data %>% 
  select(label)


# Let's view the mtb_data output
# In any kable outputs, display NAs as blanks
opts <- options(knitr.kable.NA = "")

mtb_data %>% 
  drop_na('url') %>% 
  head(25) %>%
  # Fix up the headers by replacing the underscores with spaces
  rename_all(funs(str_replace_all(., "_", " "))) %>% 
  # Make everything proper capitalization
  rename_all(funs(str_to_title)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = F,
                font_size = 12) %>%
  # Make the header row bold and black so it's easier to read
  row_spec(0, bold = T, color = "black") %>% 
  scroll_box(height = "400px", width = "100%")
Model Brand Build Type Price Url Image Setting Size Used Label Rear Travel Fork Travel F Piston F Rotor Dim R Piston R Rotor Dim Head Angle Seat Angle Crank Length Stem Length Handlebar Width Reach Stack Wheelbase Chainstay Length Bb Height Standover Height Bike Category
element rocky mountain https://bikes.com/collections/element pos3 L xc 120 130 2 180 2 180 65.50 76.5 170 50 780 480 627 1231 809 Cross Country
instinct rocky mountain Carbon 90 9799 https://bikes.com/collections/instinct pos3 L tr 140 150 65.50 76.5 170 40 780 435 629 1277 800 Trail
Altitude Rocky Mountain Carbon 90 Rally Edition 10229 https://bikes.com/collections/altitude pos3 L en 160 170 64.80 75.8 170 40 780 478 635 1249 815 Enduro
jeffsy yt Uncaged 6 9499 https://us.yt-industries.com/products/bikes/465/jeffsy-uncaged-6/preview https://cdn-prod.yt-industries.com/media/image/03/eb/be/Jeffsy_29_CF_SuperPro_Seite.png L am 150 66.00 77.0 175 50 780 470 627 1218 435 344 701 All Mountain
izzo yt Uncaged 7 7499 https://us.yt-industries.com/products/bikes/466/izzo-uncaged-7/preview https://cdn-prod.yt-industries.com/media/image/28/aa/9a/07_Izzo_29_CF_SuperLIGHT_SideVcEjYkvRoG6p2.png L tr 120 66.50 77.5 170 50 740 475 619 1206 432 332 714 Trail
Capra Yt Uncaged 9 3999 https://us.yt-industries.com/products/bikes/428/capra-uncaged-9/preview https://cdn-prod.yt-industries.com/media/image/8b/17/03/Capra_MX_SHRED2_Side.png L en 170 64.20 77.6 170 50 800 467 634 1248 438 349 735 Enduro
revolver Norco Revolver FS AXS 120 10999 https://www.norco.com/bikes/2021/mountain/xc-race/revolver-fs-120/revolver-fs-axs-120/ https://www.norco.com/cmsb/uploads/bikes/bikes/my21_revolver_fs_c_axs_29_120_black_silver-full.webp L xc 120 120 180 160 67.40 74.9 175 60 780 478 597 1188 341 717 Cross Country
optic Norco Optic C AXS 10499 https://www.norco.com/bikes/2021/mountain/trail/optic/optic-c-axs/ https://www.norco.com/cmsb/uploads/bikes/bikes/my21_optic_c_axs_29_black_chrome-full.webp L tr 125 140 4 180 4 180 65.00 76.0 170 45 800 480 629 1235 337 698 Trail
sight Norco Sight C SE 9999 https://www.norco.com/bikes/2022/mountain/all-mountain/sight/Sight-c-se/ https://www.norco.com/cmsb/uploads/bikes/bikes/2022_sight_c__se_grey_green_main-full.webp L am 150 160 4 200 4 180 63.50 77.7 170 40 800 485 603 1259 342 683 All Mountain
range Norco Range C1 9499 https://www.norco.com/bikes/2022/mountain/enduro/range/range-c1/ https://www.norco.com/cmsb/uploads/bikes/bikes/range_c1_black_silver-boty-main-full.webp L en 170 170 200 200 63.25 77.0 170 40 800 480 641 1285 355 665 Enduro
spur transition XG1 6499 https://www.transitionbikes.com/Bikes_Spur.cfm https://www.transitionbikes.com/WebStoreImages/2020_AdminProductImage_Spur_XO1_BlackPowder.jpg M dc 120 120 4 180 4 160 66.00 76.2 175 50 800 455 610 1190 435 335 670 Downcountry
sentinel transition 7199 https://www.transitionbikes.com/Bikes_Sentinel.cfm https://www.transitionbikes.com/WebStoreImages/2021_AdminProductImage_SentinelCarbon_XO1_LoamGold.jpg M tr 150 160 4 200 4 200 63.60 77.5 170 40 800 450 622 1233 440 346 688 Trail
Spire Transition https://www.transitionbikes.com/Bikes_Spire.cfm high M en 170 170 4 203 4 203 63.00 78.8 170 40 800 460 619 1257 446 350 700 Enduro
Spire Transition https://www.transitionbikes.com/Bikes_Spire.cfm low M en 170 170 4 203 4 203 62.50 78.3 170 40 800 455 624 1259 448 343 695 Enduro
Patrol Transition https://www.transitionbikes.com/Bikes_Patrol.cfm high M am 160 160 4 203 4 203 63.50 78.8 165 40 800 455 623 1231 434 340 690 All Mountain
Patrol Transition https://www.transitionbikes.com/Bikes_Patrol.cfm low M am 160 160 4 203 4 203 63.00 78.3 165 40 800 450 628 1233 436 333 700 All Mountain
ranger revel XX1 AXS EAGLE KIT 12099 https://www.revelbikes.com/our-bikes/ranger/ https://www.revelbikes.com/wp-content/uploads/2020/06/Ranger-_Johnny_Green_Jeans_Complete_3-4_Drive_Side_WEB.jpg L dc 115 120 2 180 2 160 67.50 75.3 170 40 780 473 619 1194 436 338 699 Downcountry
rascal Revel XX1 EAGLE AXS KIT 10999 https://www.revelbikes.com/our-bikes/rascal/ https://www.revelbikes.com/wp-content/uploads/2020/10/Rascal-Frame-Cosmic-Purple-Profile-WEB.jpg L tr 130 140 4 180 4 180 66.00 75.0 170 40 780 464 618 1220 433 335 716 Trail
rail Revel XX1 EAGLE AXS KIT 10999 https://www.revelbikes.com/our-bikes/rail29/ https://www.revelbikes.com/wp-content/uploads/2022/03/image001_Cropped_2-980x664.jpg.webp L en 165 170 4 180 4 180 65.00 76.0 270 40 800 469 637 1228 436 348 728 Enduro
Exie Ibis XX1 12799 https://www.ibiscycles.com/bikes/exie https://assets-ibiscycles-com.s3.amazonaws.com/images/Bikes/Exie/Builder/exie-build-xx1-1200.png L xc 100 120 2 180 2 160 67.20 74.8 175 50 780 478 606 1202 435 748 Cross Country
Ripmo Ibis XX1 11699 https://www.ibiscycles.com/bikes/ripmo https://assets-ibiscycles-com.s3.amazonaws.com/images/Bikes/Ripmo-2/Builder/ripmo2-buildKit-xx1-axs-blue-july20.png L tr 147 160 4 200 4 180 64.90 76.0 175 50 780 475 628 1238 435 341 740 Trail
Ripley Ibis XX1 11499 https://www.ibiscycles.com/bikes/ripley https://assets-ibiscycles-com.s3.amazonaws.com/images/Bikes/Ripley-4/Builder/ripley4-buildKit-xx1AXS-MY21-black-120621.png L tr 120 130 2 180 2 180 66.55 76.0 175 50 780 475 622 1207 432 335 742 Trail
Mach 4 SL pivot SL V2 8399 https://global.pivotcycles.com/products/mach-4-sl https://portal.pivotcycles.com/portalimages/bikes/clv/004SLV229CXXEC-GRYLGDRPNN.png L xc 100 100 2 160 2 160 68.50 74.5 760 460 612 1160 431 325 687 Cross Country
Trail 429 pivot Pro XT/XTR w/ Carbon Wheels 9699 https://global.pivotcycles.com/products/trail-429 https://portal.pivotcycles.com/portalimages/bikes/clv/00TLCV329CXRA--SILMDDRPNN.png Low BB L tr 120 130 4 180 4 180 66.50 75.5 780 475 609 1204 430 347 682 Trail
Trail 429 pivot Pro XT/XTR w/ Carbon Wheels 9699 https://global.pivotcycles.com/products/trail-429 https://portal.pivotcycles.com/portalimages/bikes/clv/00TLCV329CXRA--SILMDDRPNN.png Lower BB L tr 120 130 4 180 4 180 66.00 75.0 780 470 613 1205 432 340 677 Trail

EDA

In this section, we’ll take a look at the 74 mountain bikes in our dataset and some of the 27 features. We’ll try to break down our understanding of the data in terms of label, our target variable that acts as the category for each mountain bike.

Label (Mountainbike Category)

As stated earlier, there are 5 mountain bike categories in our dataset:

  1. Cross Country (xc)
  2. Enduro (en)
  3. Trail (tr)
  4. All Mountain (am)
  5. Downcountry (dc)

Let’s look at how many of each we have in our dataset.

mtb_data %>% 
  group_by(bike_category) %>% 
  tally() %>% 
  arrange(desc(n)) %>% 
  # Start our visualization, creating our groups by party affiliation
  ggplot(aes(y = forcats::fct_reorder(bike_category, n), x = n)) +
  geom_col(fill = "slateblue", na.rm = T) +
  # Add a label by recreating our data build from earlier
  geom_label(aes(label = n),
             size = 5,
             # Scooch the labels over a smidge
             hjust = .25) +
  # Let's change the names of the axes and title
  xlab("Number of Bikes") +
  ylab("Category (label)") +
  labs(title = "Number of Mountain Bikes per Category")

We see that out of our 74 bikes, most of them are Trail bikes, with the smallest grouping of bikes being all mountain bikes

Categorical Variables

There are 4 categorical variables we’ll take a look at to better understand our data:

  1. Setting
  2. Size
  3. Front Piston (f_piston)
  4. Rear Piston (r_piston)
mtb_data %>% 
  select(-label, -bike_category) %>% 
  DataExplorer::plot_bar(ggtheme = theme_classic(),
                         title = 'Distribution of Categorical Variables',
                         theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                            color = "slateblue4", 
                                                                            size = 24),
                                                  plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                  plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                  axis.title.x = element_text(size = 14),
                                                  axis.title.y = element_text(size = 14)),
                         maxcat = 15,
                         ncol = 2)

  • We see that only few bikes have a setting value, which is a feature that allows a rider to slightly adjust the frame’s geometry to hone in rider comfort. Later on, we’ll group by settings for the same bike and average the results to get a more accurate representation of the bikes’ specs.
  • Most of the bikes analyzed have 4 rear/front pistons. The two variables seem to be perfectly in-sync, leading us to believe that they’re highly correlated.

But, really, we care about understanding how these different variables interact with our target variable, label. Let’s look at their distribution and look for any patterns.

mtb_data %>% 
  select(-label) %>% 
  DataExplorer::plot_bar(ggtheme = theme_classic(),
                         by = 'bike_category',
                         by_position = 'fill',
                         title = 'Distribution of Categorical Variables',
                         theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                            color = "slateblue4", 
                                                                            size = 24),
                                                  plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                  plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                  axis.title.x = element_text(size = 14),
                                                  axis.title.y = element_text(size = 14)),
                         maxcat = 15,
                         ncol = 2)

Here we see:

  • The size used for most of the bikes is pretty evenly distributed. For the most part, we attempted to find bikes that are sized to the heights of the authors of this report (approx. 5’8”-5’11”), which tended to be Large-sized bikes; however, for some bikes, like Trail, the specific bike’s company website from which we pulled the data recommended a Medium-sized bike.
  • Although most of the bikes have 4-piston brakes, of the bikes that have 2 pistons, most are Cross Country (xc) bikes. 4-piston brakes are known to have higher stopping power which is more important the more the rider intends to ride downhill. However, they come at the cost of additional weight, which most XC riders will avoid at all costs.

Continuous Variables

To analyze the continuous features within our dataset, we built density plots for each of them to better understand their distribution.

DataExplorer::plot_density(mtb_data,
                             ggtheme = theme_classic(),
                             title = 'Distribution of Continuous Variables',
                             geom_density_args = list(fill = 'slateblue'),
                             theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                                color = "slateblue4", 
                                                                                size = 24),
                                                      plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                      plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                      axis.title.x = element_text(size = 14),
                                                      axis.title.y = element_text(size = 14)),
                             ncol = 3)

~Normally Distributed Variables:

  • Chainstay_length
  • Fork_travel
  • Bb_height
  • Seat_angle

Skewed Variables:

  • Head_angle (skewed right)
  • Handlebar_width (skewed left)
  • Wheelbase (skewed left)

Multi-Modal Distributed Variables:

  • f_rotor_dim / r_rotor_dim
  • Stem_length

Like we did for continuous variables, let’s look at the distribution of each of these predictors by our target variable, label, to look for any discernible patterns.

mtb_data %>% 
  DataExplorer::plot_boxplot(by = 'label',
                             geom_boxplot_args = list('fill' = 'slateblue'),
                           ggtheme = theme_classic(),
                           theme_config = theme(plot.title = element_text(hjust = 0, 
                                                                          color = "slateblue4", 
                                                                          size = 24),
                                                plot.subtitle = element_text(hjust = 0, color = "slateblue2", size = 10),
                                                plot.caption = element_text(color = "dark gray", size = 10, face = "italic"),
                                                axis.title.x = element_text(size = 14),
                                                axis.title.y = element_text(size = 14)),
                           ncol = 3)

Here we see:

  • Cross Country (xc) bikes tend to have the largest head angle and smallest seat angle compared to other bikes. They also have the largest stem length by a significant margin. Overall, Cross Country bikes tend to be the most differentiable from other bike categories;
  • All Mountain (am) bikes have a significantly smaller standover height and, along with Enduro (en) bikes, have a much larger reach than other bike categories;
  • As is generally expected, Trail (tr) bikes tend to fit mostly in the middle for most of these continuous’ variables. This makes sense given that they tend to split the difference between Cross Country and Enduro bikes.

Average bikes by flip-chip setting

# Split data based on setting vs. no setting
no_setting <- mtb_data %>% 
  filter(is.na(setting))
setting <- mtb_data %>% 
  filter(!is.na(setting))



setting <- cbind(setting$model, setting$label, select_if(setting, is.numeric))
setting$model <- setting$`setting$model`
setting <- setting %>% select(-`setting$model`)
setting$label <- setting$`setting$label`
setting <- setting %>% select(-`setting$label`)

mean_by_setting <- aggregate(x=select(setting, -c(model, label)),
                             by=list(setting$model, setting$label),
                             FUN=mean)
mean_by_setting$model <- mean_by_setting$Group.1
mean_by_setting$label <- mean_by_setting$Group.2
mean_by_setting <- mean_by_setting %>% select(-c(Group.1, Group.2))

no_setting <- cbind(no_setting$model, no_setting$label, select_if(no_setting, is.numeric))
no_setting$model <- no_setting$`no_setting$model`
no_setting <- no_setting %>% select(-`no_setting$model`)
no_setting$label <- no_setting$`no_setting$label`
no_setting <- no_setting %>% select(-`no_setting$label`)

new_mtb_data <- data.frame(rbind(mean_by_setting, no_setting))

rownames(new_mtb_data) <- new_mtb_data$model

rm(no_setting)
rm(mean_by_setting)

Because some bikes’ websites would have two different “settings” for the same-sized bike, we opted to include both options and average the two together to get one middle-of-the-road estimate for that type of bike. We end up performing this operation for 47% of the bikes in our dataset.


Methodology

Now that we have a better understanding of our mountain bike dataset, we’ll formulate a plan to prove the following hypothesis:

Applying our own clustering algorithms will either give us a different set number of clusters (rather than the 5 pre-ordained categories) OR will not provide clearly defined clusters, leading us to believe that the bikes are actually created on a spectrum and cannot be grouped into one of the 5 pre-ordained categories.

To do so, we’ll:

  • Try to use various methods to reduce the featureset and see if there are certain variables that can better be used to differentiate between different mountain bike categories.

  • Apply various clustering and classification algorithms, including K-Means Clustering, Gaussian Mixture Models, and Multi-class Support Vector Machine, to disprove the notion that 5 distinct categories of Mountain Bikes exist.


Variation Amongst Featureset

The first thing we’ll do is look to see if any of the features in our dataset are better at explaining the variation amongst the different bikes than other features. That is, it’s completely possible that two features are similar and don’t have much variation in them, even across some of the different bike categories. To do so, we’ll:

  1. Look for highly correlated features and flag these for potential removal;
  2. Run Principal Component Analysis (PCA) to see if certain features are better at explaining the variation in our data better than others.

1. Correlation

First, let’s take a look at our most highly correlated features. We’ll use the corrplot() function to better order the highly correlated features by the angular order of their eigenvectors.

mtb_correlation <- mtb_data %>% 
  # Get rid of price for now
  select(-price) %>% 
  # Select our variables of interest
  select_if(is.numeric) %>% 
  # Build our correlation matrix, such that missing values are handled by casewise deletion
  cor(use = 'complete.obs') 

# Convert our results into a tibble for easier manipulation
mtb_correlation_df <- mtb_correlation %>% 
  as_tibble() %>% 
  mutate(variable = colnames(mtb_correlation)) %>% 
  relocate(variable, everything())

# Build our correlation plot, using the angular order of the eigenvectors
corrplot(mtb_correlation,
         diag = F,
         col = COL2('PRGn'),
         tl.col = 'slateblue4',
         type = 'lower',
         method = 'color',
         order = 'AOE',
         title = 'Mountain Bike Feature Correlation'
         )

Here we see some obvious correlations, for example:

  • f_piston (front brakes) is perfectly correlated with r_piston (rear brakes), which makes sense since mountain bikes tend to use the same types/spec of brakes for the front vs. rear tires.
  • fork_travel has a correlation above .95 with: c(“rear_travel”, “fork_travel”). This make sense; for example, rear_travel should be highly correlated with fork_travel.

In all, here are the most highly correlated variables (i.e. variables which have a correlation above .9 or below -.9):

mtb_correlation_df %>% 
  pivot_longer(-variable, 
               names_to = 'correlated_variable', 
               values_to = 'correlation') %>% 
  filter(variable != correlated_variable) %>% 
  # Sort by the absolute value of correlation
  arrange(desc(abs(correlation))) %>% 
  filter((correlation > .90) | (correlation < -.90)) %>% 
  # Get rid of duplicative rows
  dplyr::distinct(correlation, .keep_all = T) %>% 
  pander()
variable correlated_variable correlation
f_piston r_piston 1
rear_travel fork_travel 0.9608
rear_travel wheelbase 0.9301
rear_travel head_angle -0.9219
fork_travel wheelbase 0.9195
fork_travel head_angle -0.9193
head_angle seat_angle -0.9031

There are a solid amount, especially given that we only have 18 continuous columns in our dataset! For now, we’ll opt to include everything. But later on, as we analyze the importance of different features, we’ll look to remove some of the above variables first.

2. Principal Component Analysis (PCA)

Next, we’ll apply PCA to our dataset. In so doing, we’ll have to center and scale our data given how different the ranges are for certain measurements. Let’s take a look at our 5 principal components which explain the largest proportion of variance in the data:

# Impute missing values with column mean (not really best practice, but good enough)
for (c in 1:ncol(new_mtb_data)){
  if (is.numeric(unlist(new_mtb_data[,c]))){
    # print(colnames(new_mtb_data)[c])
    new_mtb_data[is.na(new_mtb_data[,c]), c] <- mean(unlist(new_mtb_data[,c]), na.rm=TRUE)  
  }
}

mtb_no_null <- new_mtb_data %>% 
                select(-price) %>%
                select_if(is.numeric) %>% 
                bind_cols(label = new_mtb_data$label) %>%
                drop_na()

mtb_pca <- prcomp(mtb_no_null %>% select(-label),
                  center = TRUE,
                  scale. = TRUE)

# Put our summary results into a dataframe - Justin switching this to cbind() works for me, not sure why
mtb_pca_df <- tibble(variable = c('Standard Deviation', 'Proportion of Variance', 'Cumulative Proportion')) %>% 
  cbind(summary(mtb_pca)$importance)


mtb_pca_df %>% 
  # Only display the first 6 columns
  select(c(PC1:PC5)) %>% 
  pander()
  PC1 PC2 PC3 PC4 PC5
Standard deviation 3.024 1.262 1.164 1.071 0.8761
Proportion of Variance 0.538 0.09369 0.07977 0.06745 0.04515
Cumulative Proportion 0.538 0.6317 0.7115 0.7789 0.8241
mtb_pca_df %>% 
  # Pivot our data so it's easier to visualize
  pivot_longer(-variable, 
               names_to = 'PC',
               names_prefix = 'PC') %>% 
  # Make the principal component column an integer so ggplot orders it from 1:17 properly
  mutate(PC = as.integer(PC),
         # Convert value to % (multiply by 100) so it's not a decimal
         value = 100*value) %>% 
  filter(variable == 'Proportion of Variance') %>% 
  ggplot(aes(x = PC, y = value)) +
  geom_point(size = 4, color = 'slateblue') +
  geom_line(alpha = .6, lwd = 2, color = 'slateblue') + 
  labs(title = 'Proportion of Variance Explained by Principal Components',
       x = 'Principal Component',
       y = 'Proportion of Variance (%)')

We can see that, actually, our 1st principal component alone explains more than half our data. Starting at the 2nd principal component, there’s a distinguishable elbow point. After that, we have a huge drop-off. Starting at our 5th principal component, nearly 82.4% of the data’s variation is properly explained. This leads us to believe that the majority of the variation in our data can be explained by using just 1 principal component!

Let’s take a look at how our top 2 principal components explain the 5 different mountain bike categories:

p_load(devtools,
       ggbiplot)

# jpeg('../Images/pca.jpg')

ggbiplot(mtb_pca,
              obs.scale = 1,
              var.scale = 1,
              groups = mtb_no_null$label,
              ellipse = TRUE,
              circle = FALSE,
              ellipse.prob = .5) + 
  theme(legend.direction = 'horizontal',
               legend.position = 'top')

Here we can see that our top 2 principal components, which explain roughly 63.2% of the variation in our data, are already pretty good representations for describing the different components in our dataset. Even so, the groupings are distinctly plotted on the 2-D graph and it is pretty easy to see how the different bike categories (denoted by color) can be explained using a linear transformation of our existing data.


Clustering

Because we are investigating the validity of mountain bike categories, one approach is to treat this dataset as unsupervised, stripping the bikes of their label and seeing if various clustering algorithms can re-create the 5 distinct labels. To do so, we’ll take a look at the following algorithms:

  • K-Means
  • Gaussian Mixture Models (GMM)
  • Support Vector Machine (SVM)

K-Means

We’ll start by using the K-Means Clustering algorithm, looking at various numbers of clusters (k) and seeing if the bikes logically group together.

# How many clusters are necessary? 4?

mtb_numeric <- mtb_no_null %>% 
  select(-label)
mtb_standard_scaled <- scale(mtb_numeric)

mtb_numeric <- mtb_no_null %>% 
  select(-label)

mtb_numeric <- mtb_no_null %>% 
  select(-label)

clusters <- 1:10
dists <- c()
for (c in 1:10){
  km <- kmeans(mtb_standard_scaled, centers=c, iter.max=1000)
  dists <- c(dists, km$tot.withinss)
}

# jpeg('../Images/Kmeans.jpg')
# plot(clusters, dists, type='l', xlab='Clusters', ylab='Total Sum of Squared Euclidean Distances')

# Plot our results
tibble(clusters = clusters,
       dists = dists) %>% 
  ggplot(aes(x = clusters, y = dists)) + 
  geom_point(size = 4, color = 'slateblue') +
  geom_line(alpha = .6, lwd = 2, color = 'slateblue') + 
  labs(title = "K-Means Clustering of MTB Data",
       subtitle = 'Method uses `tot.withinss` parameter to measure distances.',
       x = 'Clusters',
       y = 'Total Sum of Squared Euclidean Distances')

# Let's see where these clusters would end up on the 2D PCA plot
mtb_pca_scaled <- prcomp(mtb_standard_scaled,
                  center = F,
                  scale. = F)

pca_2_scaled <- as.matrix(mtb_standard_scaled) %*% as.matrix(mtb_pca_scaled$rotation[,1:2])

pca_km_scaled <- kmeans(pca_2_scaled, centers=3, iter.max=1000)


# Bring our PCA and k-means clusters results into our dataset
new_mtb_data %>% 
  cbind(pca_2_scaled) %>% 
  mutate(# Create a feature for the long-version of the names
         bike_category = case_when(
          label == 'tr' ~ 'Trail',
          label == 'xc' ~ 'Cross Country',
          label == 'dc' ~ 'Downcountry',
          label == 'am' ~ 'All Mountain',
          label == 'en' ~ 'Enduro',
          TRUE ~ 'red'
        ),
        # Bring our clusters in as a factor
        cluster = as.factor(pca_km_scaled$cluster)) %>% 
  # GG-plot our shit - lol
  ggplot() +
    geom_point(aes(x = PC1, 
                   y = PC2, 
                   color = cluster, 
                   shape = bike_category), 
               alpha = .9, 
               size = 3) +
    # Add our cluster centers in as well
    geom_point(data = as_tibble(pca_km_scaled$centers) %>%
                 mutate(cluster = as.factor(c(1, 2, 3))), 
               aes(x = PC1, 
                   y = PC2,
                   color = cluster), 
               shape = 10, 
               size = 7) + 
  # Color clusters accordingly
    scale_color_manual(values = c('slateblue4', 'gray', 'slateblue1'), name = 'Cluster') +
    labs(title = "K-Means Clustering of MTB Principal Components",
       subtitle = 'Assigned clusters denoted by color;\nBike categories denoted by shape;\nCluster centers denoted by large cross-hairs shape.',
       x = 'Principal Component 1',
       y = 'Principal Component 2')

#TODO let's look at this bottom cluster - both Niner bikes
# Niner has low reach numbers on its bikes - could be because we used the Medium for these!
# Based on PCA mapping, the blur tr, expic, Exie, Ripley, and Element all have less chainstay length, and less pistons?? wow, should we exclude piston count?? with more 2 piston bikes getting added, it evens out the average, so these aren't showing up as much anymore

Above, we attempted to graph the 3 clusters created using top 2 principal components in our data. For example, we can see Cluster #1 on the right-hand side of the chart, mostly composed of Cross Country bikes (triangles in the chart) and some Downcountry bikes (denoted by squares). Downcountry bikes also seem to be part of Cluster #2 (gray points), along with Trail bikes (denoted by squares with an ‘x’ in them) and some Enduro bikes (denoted by ‘+’). However, Trail bikes also feature heavily in Cluster #3 along with most of the Enduro bikes.

Overall, it’s clear that there is significant overlap between our Clusters, mainly along the Principal Component 1 axis; lending credence to the notion that our bikes can be differentiated along a single, continuous scale.

Note: In the bottom-right of the graph (PC2 < -4), we see two Niner bikes, almost acting as outliers. For a 5’10” rider Niner suggests a size Medium, which results in low reach numbers on its bikes. From the earlier PCA plot, we see that Reach heavily corresponds with PC2, and thus these bikes appear lower on the visual.

Gaussian Mixture Model (GMM)

In this section, we’ll take a more probabilistic model to our clustering. That is, we’ll use a Guassian Mixture Model (GMM) to build out normally distributed subgroupings within our mountain bike dataset, where the densities of each of the subgroupings represents a probability that a bike belongs to that subgrouping. Unlike K-Means, which is a more centroid-based clustering method, GMM is more of a distribution-based clustering method.

Generally, what we expect to see is something like the following:

where, given a specific type of bike, we can predict the probability, \(p(x)\) that a bike belongs to a category like Cross Country (xc) vs. Trail vs. Enduro.

We’ll run the ClusterR::GMM() function in R to figure out an optimal number of clusters. It uses the expectation-maximization algorithm to perform the probabilistic clustering; at each iteration, it aims to maximize the Bayesian Information Criterion (BIC) to determine an optimal number of clusters.

p_load(ClusterR)


opt_gmm <- Optimal_Clusters_GMM(mtb_standard_scaled, 
                     max_clusters = 10, 
                     criterion = "BIC", 
                     dist_mode = "eucl_dist", 
                     seed_mode = "random_subset",
                     km_iter = 10, 
                     em_iter = 10, 
                     var_floor = 1e-10, 
                     verbose = T,
                     plot_data = T)
## iteration: 1  num-clusters: 1
## iteration: 2  num-clusters: 2
## iteration: 3  num-clusters: 3
## iteration: 4  num-clusters: 4
## iteration: 5  num-clusters: 5
## iteration: 6  num-clusters: 6
## iteration: 7  num-clusters: 7
## iteration: 8  num-clusters: 8
## iteration: 9  num-clusters: 9
## iteration: 10  num-clusters: 10

Optimal_GMM

From the plot above, we see that the BIC value decreases generally as the number of clusters increases. However, it appears that the first big drop is when clusters = 3. Let’s try that value out and see our results.

# Build our GMM model
mtb_gmm <- GMM(mtb_standard_scaled,
               gaussian_comps = 3,
               dist_mode = 'maha_dist', # Distance metric to use during seeding of initial means clustering
               seed_mode = 'random_subset', # How initial means are seeded prior to EM alg
               km_iter = 10, # Num of iterations of K-Means alg
               em_iter = 10, # Num of iterations of EM alg
               verbose = F
               )

# Run our predictions and convert it to a dataframe
mtb_gmm_pred <- predict_GMM(mtb_standard_scaled, 
                            mtb_gmm$centroids, 
                            mtb_gmm$covariance_matrices, 
                            mtb_gmm$weights) 

mtb_gmm_pred <- bind_cols(data.frame(mtb_gmm_pred$log_likelihood),
                       data.frame(mtb_gmm_pred$cluster_proba),
                       mtb_gmm_pred$cluster_labels) %>% 
  as_tibble()

# Rename our columns
names(mtb_gmm_pred) <- c('log_likelihood_c1', 
                         'log_likelihood_c2', 
                         'log_likelihood_c3',
                         'cluster1_prob',
                         'cluster2_prob',
                         'cluster3_prob',
                         'cluster_labels')

mtb_gmm_pred %>% 
  select(cluster_labels) %>% 
  bind_cols(mtb_no_null$label) %>% 
  # xtabs(~ ., data = .) %>% 
  table() %>% 
  bind_cols(cluster_labels = c('1', '2', '3')) %>% 
  relocate(cluster_labels, everything()) %>% 
  pander()

New names: • `->…2`

cluster_labels am dc en tr xc
1 1 0 5 11 1
2 2 1 9 8 0
3 0 7 0 2 11

Here we see the predicted cluster labels along with the actual 5 bike categories in our data. We see that Trail and Enduro bikes are mostly grouped into Clusters 1 and 2, while Downcountry and Cross Country bikes are grouped into Cluster 3. This would lead us to believe that the 3 clusters are not necessarily fully representational of the data and that we could suffice with fewer clusters.

Even so, let’s see how the probability of each bike belonging to a cluster appears by looking at the densities of each of the associated probabilities for a bike belonging to one of the cluster labels.

# Plot the densities for our clusters
cluster1_pred <-  mtb_gmm_pred %>% 
  filter(cluster_labels == 1) %>%
  ggplot(aes(x = cluster1_prob, y=..scaled..)) +
  # Plot the density and adjust the curvature so it looks better
  geom_density(position = 'stack', alpha = .1, fill = 'slateblue', color = 'slateblue', adjust = 10) +
  labs(title = 'Cluster 1',
       x = 'Probability',
       y = 'Density')

  
cluster2_pred <-  mtb_gmm_pred %>% 
  filter(cluster_labels == 2) %>%
  ggplot(aes(x = cluster2_prob, y=..scaled..)) +
  # Plot the density and adjust the curvature so it looks better
  geom_density(position = 'stack', alpha = .1, fill = 'slateblue', color = 'slateblue', adjust = 25) +
  labs(title = 'Cluster 2',
       x = 'Probability',
       y = 'Density')

cluster3_pred <- mtb_gmm_pred %>% 
  filter(cluster_labels == 3) %>%
  ggplot(aes(x = cluster3_prob, y=..scaled..)) +
  # Plot the density and adjust the curvature so it looks better
  geom_density(position = 'stack', alpha = .1, fill = 'slateblue', color = 'slateblue', adjust = 250) +
  labs(title = 'Cluster 3',
       x = 'Probability',
       y = 'Density')

gridExtra::grid.arrange(cluster1_pred, cluster2_pred, cluster3_pred, ncol = 3)

Above, we see the expected probability associated with predicting a correct class label. That is, the graph on the left shows how accurate GMM was for predicting bikes fitting into Cluster #1. Generally, the probabilities for the predictions are all above .95; that is, GMM is extremely confident in grouping the different bikes into these 3 clusters.

Multi-class SVM

If we were to treat our labels as truth, then we can approach this analysis as a supervised learning model. For this section, we chose to group the All Mountain Category in with Enduro, since it was completely overlapped on the PCA chart above. We also chose to switch around the categorization of the Downcountry category, leaving it as a separate category and grouping it with both Trail and XC to experiment with the results of the model.

We chose to use a Multi-Class Support Vector Machine, and a grid search to tune the kernel functions and \(\gamma\) values. For each set of parameters, we used K-fold cross validation with k=10 on all rows of the data. We decided against holding out data as a test set since we have such limited data, and the K-fold CV should evaluate the model’s performance on blind data.

Using all of the data, the best SVM model was 73% accurate, using a Radial Basis kernel function with \(\gamma=2.595024\)

Treating the Downcountry category as XC, the best model was 81.6% accurate, with a Radial Basis kernel function with \(\gamma=0.02983647\)

Treating the Downcountry category as Trail, the best model was 80.0% accurate with a Radial Basis kernel function with \(\gamma=3.764936\)

Of course, grouping this category with one of its adjacent categories we expect to see an increase in performance, but this could suggest that the Downcountry category is slightly more skewed towards the XC bikes.

p_load(e1071,
       caret)

#convert all mountain category to enduro, dc -> Xc?
remap <- function(x, num){
  if (x=='am' || x=='en'){
    if (num){
      return(4)
    }
    else{
      return('Enduro')
    }
  }
  else if(x=='xc'){
    if(num){
      return(1)
    }
    else{
      return('Cross Country')
    }
  }
  else if(x=='dc'){
    if(num){
      return(2)
    }
    else{
      return('Downcountry')
    }
  }
  else if(x=='tr'){
    if(num){
      return(3)
    }
    else{
      return('Trail')
    }
  }
}
labels <- as.factor(unlist(lapply(new_mtb_data$label, remap, F)))


trainSVM <- function(x, y, idx, k='radial basis', g=0){
  xtest <- x[idx,]
  xtrain <- x[-idx,]
  ytest <- y[idx]
  ytrain <- y[-idx]
  
  if(k=='linearl'){
    clf <- svm(x=xtrain, y=ytrain, kernel=k)
  }
  else{
    if (g==0){
      clf <- svm(x=xtrain, y=ytrain, kernel=k)
    }
    else{
      clf <- svm(x=xtrain, y=ytrain, kernel=k, gamma=g)
    }
  }
  
  preds <- predict(clf, xtest)
  
  acc <- 0
  cm <- table(ytest, preds)
  for (i in 1:length(unique(labels))){
    acc <- acc + cm[i,i]
  }
  return(acc/sum(cm))
}



# Roughly 66.6% accuracy when treating down country as separate category
# But - roughly 68.8% accuracy when treating down country as XC, only 69% accuracy when treating downcountry as trail, suggests that downcountry bikes are more akin to trail than they are XC
folds <- createFolds(labels, k=10)


#Grid Search for SVM
kernels <- c('linear', 'polynomial', 'radial', 'sigmoid')
gammas <- seq(-5, 3, length.out=100)
gammas <- 10^gammas


#Change these below
X = pca_2_scaled
# X = mtb_standard_scaled
y = labels

results <- matrix(ncol=3, nrow=0)
colnames(results) <- c('acc', 'kernel', 'gamma')

for (k in kernels){
  if (k=='linear'){
    
    folds <- createFolds(labels, k=10)
    accs <- c()
    for (fold in folds){
      # print('here')
      acc <- trainSVM(X, y, fold, k='linear', g=0)
      accs <- c(accs, acc)
    }
    results <- rbind(results, c(mean(accs), k, 0))
  }
  else{
    for (g in gammas){
      # print('choosing gamma')
      # print(k)
      folds <- createFolds(labels, k=10)
      accs <- c()
      for (fold in folds){
        acc <- trainSVM(pca_2_scaled, labels, fold, k=k, g=0)
        accs <- c(accs, acc)
      }
    results <- rbind(results, c(mean(accs), k, g))
    }
  }
}

# results <- data.frame(results)
# results$acc <- as.numeric(results$acc)
# results$gamma <- as.numeric(results$gamma)
# results[which.max(results$acc), ]
# 
# 
# nonlinear_svm <- results[-(results$kernel=='linear'), ]
# 
# nonlinear_svm[which.max(nonlinear_svm$acc), ]
# On two axis, the best svm model is linear

To visualize the SVM, we again mapped all features to the 2 Principal Components. In this scenario, we actually acheived a higher accuracy of 75% using a linear kernel, again treating the Downcountry category as its own distinct category.

## This cell only to visualize linear kernel
pcsvm <- svm(x=pca_2_scaled, y=labels, kernel='linear', gamma=0.2782559 )

dat <- data.frame(pca_2_scaled)

grid <- expand.grid(seq(min(dat[, 1]),max(dat[,1]),length.out=100),seq(min(dat[,2]),max(dat[,2]),length.out=100)) 
names(grid) <- names(dat)[1:2]
preds <- predict(pcsvm, grid)
df <- data.frame(grid, preds)

# ggplot(df, aes(x = PC1, y = PC2)) + 
#   geom_tile(aes(fill=preds)) +
#   geom_point(data = dat, aes(shape = labels), size = 2) + 
#   labs(title = "Support Vector Machine Classification",
#          x = 'Principal Component 1',
#          y = 'Principal Component 2')

ggplot(df, aes(x = PC1, y = PC2)) + 
  geom_tile(aes(fill = preds), alpha = .2) +
    scale_fill_manual(values = c('slateblue4', 'gray', 'slateblue1', 'green', 'red')) +
  geom_point(data = dat, aes(shape = labels, color = labels), size = 2) + 
  scale_color_manual(values = c('slateblue4', 'gray', 'slateblue1', 'green', 'red')) +
  labs(title = "Support Vector Machine Classification",
         x = 'Principal Component 1',
         y = 'Principal Component 2')

An interesting observation is that most of the boundary lines are more or less vertical, suggesting that most of the variation between classes is along Principal Component 1. We see this deviate between the XC and Downcountry boundary, however the validity of this boundary is still in question since the Downcountry category itself is more or less unofficial.

# DC as XC, using Linear
remap2 <- function(x, num){
  if (x=='am' || x=='en'){
    if (num){
      return(4)
    }
    else{
      return('Enduro')
    }
  }
  else if(x=='xc' || x=='dc'){
    if(num){
      return(1)
    }
    else{
      return('Cross Country')
    }
  }
  else if(x=='tr'){
    if(num){
      return(3)
    }
    else{
      return('Trail')
    }
  }
}
labels2 <- as.factor(unlist(lapply(new_mtb_data$label, remap2, F)))

pcsvm2 <- svm(x=pca_2_scaled, y=labels2, kernel='linear')

preds2 <- predict(pcsvm2, grid)
df2 <- data.frame(grid, preds2)

ggplot(df2, aes(x = PC1, y = PC2)) + 
  geom_tile(aes(fill=preds2)) +
  geom_point(data = dat, aes(shape = labels2), size = 2) + 
  labs(title = "Support Vector Machine Classification",
         x = 'Principal Component 1',
         y = 'Principal Component 2')

Mapping all Downcountry bikes to XC, the boundaries become almost entirely vertical, again suggesting that the classification of bikes can be attributed to Principal Component 1.


Conclusions

Findings

  • All results suggest that trying to discretely categorize full suspension mountain bikes is more or less arbitrary.

  • The categorization of a mountain bike should be treated as on a continuous scale, with Cross Country (XC) bikes on one end and Enduro (EN) bikes on another.

  • To obtain where a specific bike lies on this scale, one can use the linear combination of the bike’s specifications and the first principal component.

  • This new spectrum of mapping bikes can provide bike manufacturers and consumers a method to quantify how a bike will handle when ridden.

Let’s look at an example from the data above. Some bike companies, like Transition and Revel, do not explicitly categorize their bikes like others do. For these brands, we categorized them based on general attributes, as well as media coverage of them.

Rear Travel Fork Travel Front Piston Front Rotor Diameter Rear Piston Rear Rotor Diameter Head Angle Seat Angle Crank Length Stem Length Handlebar Width Reach Stack Wheelbase Chainstay Length Bottom Bracket Height Standover Height
115 120 2 180 2 160 67.5 75.3 170 40 780 473 619 1194 436 338 699

Converting to an input vector: \[ \text{Ranger} = \begin{bmatrix} 115 \\ 120 \\ 2 \\ 180 \\ 2 \\ 160 \\ 67.5 \\ 75.3 \\ 170 \\ 40 \\ 780 \\ 473 \\ 619 \\ 1194 \\ 436 \\ 338 \\ 699 \end{bmatrix} \rightarrow \text{ Ranger (scaled)} = \begin{bmatrix} -0.61 \\ -0.80 \\ -1.68 \\ -0.53 \\ -1.68 \\ -1.30 \\ 0.79 \\ -0.39 \\ -0.40 \\-1.15 \\-0.11 \\0.60 \\0.16 \\-0.33 \\0.16 \\-0.16 \\ -1.01 \end{bmatrix} \text{and PC1}=\begin{bmatrix} -0.31 \\ -0.31 \\ -0.23 \\ -0.26\\ -0.23\\ -0.27\\ 0.30\\ -0.26\\ -0.01\\ 0.24\\ -0.27\\ -0.17\\ -0.26\\ -0.30\\ -0.13\\ -0.25\\ 0.08\end{bmatrix} \]

\[ \text{Ranger} \times \text{PC1} = 1.7 \]

Mapping the Revel Ranger to its first Principal Component we get a value of 1.7 which puts us right around the boundary between Trail and XC, which lines up with the PinkBike editors’ labelling of Downcountry in the video mentioned in the Project Overview section.

Opportunities for Improved Analysis

There are a few opportunities to improve the analysis included in this presentation and forthcoming report:

  • Inclusion of more bikes (rows) | The most obvious improvement we can make is to add more data points to our dataset. The data for each individual bike was manually entered by one of the authors of this report. After entering data for bikes of most major bike brands, we had enough data to accurately visualize the different bike categories; however, with more bikes, our algorithms will become more robust and less affected by the presence of outliers.

  • Inclusion of more bike features (columns) | Although we included the most meaningful specs/geometry of the bikes analyzed, there are dozens of other, smaller features that can be used to help differentiate between different types of bikes.

  • Include all sizes of bikes | We chose to use the size that corresponded to a 5’10” rider, but some bike manufacturers could interpret this as a Medium while others interpret this as a Large.

  • Include bikes across multiple years | As bike trends slowly change, it’d be interesting to see how the data shifts across time. For example, the Rocky Mountain Element reduced its head angle from 70 degrees to 65.8 in one iteration of the bike. Including data from past years could provide valuable market insights into how the industry as a whole is moving.

Lessons Learned

As both authors of this report are in the midst of the Online Masters of Science of Analytics (OMSA) program, we feel that it provided reinforcement on previously seen data mining topics as well as good introductions to net new topics. Here are some individual takes on the lessons learned from the course.

Mike: I think this course was a great complement to ISYE6740, Computational Data Analytics (CDA), which was a bit more theory focused, e.g. requiring us to build ML algorithms from scratch, while this course seemed more practical focused, e.g. HW 5 allowing us to use machine learning methods on any dataset of our choice. Additionally, I think this course was a good course to take concurrently with Data Visual Analytics, as it also required the use of Random Forests, and provided a good introduction to data analysis techniques that are standard across any project (e.g. cross validation).

Justin: I really appreciated the practical nature of this course. I’ve already had the opportunity to use some of the classification and machine learning algorithms at work. Although this is a different take on ISYE 6740, Computational Data Analytics (CDA), I appreciated getting to do a similar-style course in my programming language of choice (R) and to get to really dive into the actual code, which has already paid numerous dividends in a practical setting. I took this course concurrently with ISYE 6414, Regression Analysis. Although 6414 was recommended as a prerequisite for this course, it ended up being a nice pairing of courses.

Course Suggestions

This course was a good introduction for practical applications. Some areas of improvement we would like to see included:

  • Requiring more fundamental knowledge checks for some material. More specifically, we think the topics around Information Criterion and Discriminant analysis were glossed over too quickly. This course was our first exposure to those topics, so we’re not sure if they were intentionally light on the material since they are not used in industry as much.

  • Additionally, we think the assignments in this course could have been more clear, however, we also think this could have been intentional to mimic the ambiguity of working in industry.

  • A lot of the quizzes ended up needing (or at least seriously benefiting from) pre-written code based on the knowledge checks. Writing the code was fun to make the quizzes more bearable; however, we would’ve loved to see the instructor/TAs post this kind of code after the quiz due dates so students could benefit from understanding how to solve the problems in R.